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RECONSTRUCTING THE ENERGY LANDSCAPE OF 
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Defining the energy function as the negative logarithm of the den- 
sity, we explore the energy landscape of a distribution via the tree of 
sublevel sets of its energy. This tree represents the hierarchy among 
the connected components of the sublevel sets. We propose ways to 
annotate the tree so that it provides information on both topological 
and statistical aspects of the distribution, such as the local energy 
minima (local modes) , their local domains and volumes, and the bar- 
riers between them. We develop a computational method to estimate 
the tree and reconstruct the energy landscape from Monte Carlo sam- 
ples simulated at a wide energy range of a distribution. This method 
can be applied to any arbitrary distribution on a space with defined 
connectedness. We test the method on multimodal distributions and 
posterior distributions to show that our estimated trees are accurate 
compared to theoretical values. When used to perform Bayesian in- 
ference of DNA sequence segmentation, this approach reveals much 
more information than the standard approach based on marginal pos- 
terior distributions. 



1. Introduction. The concept of a distribution is fundamental in many 
parts of modern science. In statistics we may model a set of observed data 
by assuming that they are sampled from a distribution specified up to some 
parameters, and then estimate the parameters based on the empirical data. 
Furthermore, if we use a Bayesian approach to statistical inference, then our 
knowledge of the parameters given the data is contained in the posterior 
distribution. In physics the Boltzmann distribution of a system in thermal 
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equilibrium at temperature T is 



(1) 



p(x;r) 



1 



exp(-/i(x)/r), 



Z{T) 



where ^(x) is the energy and Z{T) = J exp(— /i(x)/T) dx < oo is the normal- 
ization constant. Here the density (1) is defined with respect to a measure, 
so that discrete cases are covered by the use of counting measure. To unify 
terminology, we define the energy function of a distribution /(x), which may 
be unnormalized, as 



One can view /(x) as a Boltzmann distribution with energy h{x) at tem- 
perature T = 1. 

In many situations the distribution of interest is completely specified in 
the sense that we know how to compute /(x) for any x. However, in general, 
knowing the distribution in this way does not allow us to understand the 
information it embodies. To understand the nature of the distribution, we 
must seek answers to a multitude of questions, such as what is the expecta- 
tion of a certain function ^(X) when a random variable X is drawn from this 
distribution, where is the mode of the distribution and how dispersed is the 
distribution around it, and are there multiple regions with high probabilities 
that are well separated in the sample space? 

Before the development of modern numerical computing on digital com- 
puters, it was not possible to answer any of these questions except in very 
special cases, such as when /(x) is a multivariate normal distribution. With 
the emergence of computers in the mid-twentieth century, physicists devel- 
oped several Monte Carlo algorithms that allow the generation of samples 
from /(x) numerically. In particular, Markov Chain based methods, such as 
the Metropolis-Hastings algorithm [Metropolis et al. (1953) and Hastings 
(1970)], can be applied to sample from a distribution in a very high di- 
mensional space. Later, when statisticians adapted it to applications in 
Bayesian inference [Geman and Geman (1984); Tanner and Wong (1987); 
Gelfand and Smith (1990)], Monte Carlo sampling quickly became a popu- 
lar means to extract information from a posterior distribution. 

In principle, the availability of a large sample will allow us to understand 
the nature of the distribution by the use of standard data analysis tools. 
For example, the expectation of g(X.) can be estimated by the sample aver- 
age and the distribution of g(X.) can be approximated by the corresponding 
histogram. Although powerful, such approaches can only provide limited in- 
formation, as illustrated by the following example. The energy function h{x) 
[equation (2)] in this example has seven local minima, as indicated by red 
numbers in Figure 1(A), and the global minimum is located at the origin. It 
is very hard to recover the seven modes from any projection of samples from 
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Fig. 1. An illustrative example. (A) The contour plot of the energy function of a 2-D 
distribution with seven modes indicated by red numbers. The black numbers are energy 
levels of the contours. (B) The tree of sublevel sets of the energy function. 

this distribution. However, the topology of the interiors of various contours 
(sublevel sets) actually contains information about the local modes. At very 
low energy levels, the contours and their interiors form disconnected solid 
disks, each containing a local minimum. Then modes 2-4 become connected 
at energy level /i(x) = 12, and so do modes 5-7. At this level, the interiors of 
the contours become three disconnected regions. For an energy level > 20, 
the interior of the contour is completely connected, which links the three 
groups of modes together. Such information can be summarized by a tree of 
sublevel sets of the energy function [Figure 1(B)], in which terminal nodes 
represent the local minima and internal nodes give the energy at which the 
modes become connected (energy barriers). Such a tree was first studied 
by Hartigan (1975, 1981) in statistics and it is also related to the concept 
of a disconnectivity graph in chemical physics [Becker and Karplus (1997)]. 
Please see Section 2 for a rigorous definition. 

In this paper we propose a general method to estimate the tree of sublevel 
sets from Monte Carlo samples. We focus on the application of this method 
in understanding the energy landscape of a posterior distribution in Bayesian 
inference. This paper is organized into seven sections. Section 2 defines the 
tree of sublevel sets for a distribution. In Section 3 we develop an algorithm 
to estimate the tree and present related theoretical results. The method 
is tested on multimodal functions and posterior distributions in Section 4. 
A detailed application of this method in the Bayesian inference of DNA 
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Fig. 2. The definition of the tree of sublevel sets. (A) A hypothetical energy function. 
(B) The tree of sublevel sets of the energy function. In this tree internal nodes Ui (root) 
and U2 represent two energy harriers, while U3,it4 and us are local minima. 

sequence segmentation is presented in Section 5. The paper is concluded with 
discussions in Section 6. Mathematical proofs are provided in the Appendix. 

2. The tree of sublevel sets. Consider a continuous energy function /i(x) 
of a distribution (2) defined on a connected space X with (global) min- 
imum uq. For any u > uq, suppose the sublevel set A{u) = {x | /i(x) < 
u} contains a finite number K{u) of connected components {j4fc(u) | k = 
1, . . . ,K{u)}. We may simply call them components if the meaning is clear 
from the context. As pointed out by Hartigan (1975, 1981), the collec- 
tion A = {Ak{u) \ k = 1, . . . , K{u),u > Uq} has a hierarchical structure: For 
any two sets Ai{ui) and Aj{u2) with ui < U2, either Ai{ui) C Aj{u2) or 
Ai{ui) r)Aj{u2) = </)■ One can represent such hierarchy by a tree. The root 
of the tree is defined at the energy level ui = 'mi{u\A{u)isconnectedand 
nonempty}. If A{u) is connected for all u > uq, then ui = uq and A{ui) 
is empty, which results in a terminal node. Otherwise, A{ui) is discon- 
nected with K(ui) > 1 components and it represents an internal node. We 
further define its fcth child node at the energy level uik = mi{u\A{u) fl 
Ak{ui) is connected and nonempty} for k = 1, . . . , K{ui). Recursively ap- 
plying the above definition to each internal node defines the tree of sublevel 
sets. The leaves (terminal nodes) of the tree correspond to the local minima 
of /i(x) and the internal nodes correspond to the energy barriers that sepa- 
rate the minima. See Figure 2 for an illustration. Such a tree was also called 
the cluster tree in the context of clustering analysis [e.g., Stuezle (2003)]. 

To provide further information on the energy landscape, we propose to 
annotate the tree by local density of states. The density of states f7(n) is 
a function of energy defined as the derivative of the volume of A(u) with 
respect to u: 
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where l(-) is the indicator function. By definition, the infinitesimal volume of 
the level set {x | /i(x) € — du, u)} is il.{u) du. Similarly, for each component 
Ak{u)^ we define 



as the local density of states for A; = 1, ... , K{u). Obviously, = '}2k ^k{u)- 
From local density of states one can readily compute many statistical prop- 
erties of a local minimum at different temperatures. Suppose x.^ is a local 
minimum of /i(x) whose parent on the tree of sublevel sets is B. Here we use 
B to denote the node on the tree as well as its energy. For any u > /i(xm) 
there exists a unique integer k G {1, . . . ,K{u)} such that x^, G A/^^u), and 
we denote this integer by k{u,Xm)- Let i = k{B,Xm)- Then Ai{B) defines a 
unique local domain of the minimum before it reaches any energy barrier. 
For example, in Figure 2(B), the unique local domain of the node M4 corre- 
sponds to the branch between the nodes U2 and U4 on the tree. It will be 
informative to compute expectations over such local domains with respect 
to the Boltzmann distribution (1). For instance, the probability of visiting 
Ai{B) at temperature T, which is called the probability mass of the local 
minimum x.^ hereafter, can be computed by 



which only involves one-dimensional integrals. A similar formulation may 
be adapted to calculate marginal likelihood in Bayesian model selection 
as in Skilling (2006). Note that both the tree of sublevel sets and the lo- 
cal density of states are independent of temperature T and intrinsically 
determined by the energy function. In statistics, this implies that once 
they are estimated for a distribution /(x), we can use them to calculate 
expectations and describe the energy landscape for a tempered distribu- 
tion, [/(x)]-*^/-^ oc exp(— /i(x)/T), or a truncated one, exp(— (/i(x) Vi/)) = 
exp(— max{h{x.) , H)) . 

3. Estimation of the tree of sublevel sets. Now we turn to the central 
question of this article: How to construct the tree of sublevel sets based on 
Monte Carlo samples from a distribution /(x)? It is practically impossible 
to obtain information about minima and barriers at high energy levels if we 
only have samples from the target distribution /(x). To construct a reason- 
able estimate of the high energy portion of the tree, we need to generate 
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samples from tempered versions of the distribution as used in parallel tem- 
pering [Geyer (1991)], that is, f{x;T) oc exp(— /i(x)/r), or from tempered- 
truncated versions as in the equi-energy sampler [Kou, Zhou and Wong (2006)], 
that is, /(x; T, H) oc exp(— (/i(x) Vff)/T). In what follows we assume that we 
have generated samples from a sequence of tempered or tempered-truncated 
distributions and for each sample we have recorded its energy /i(x). Estimat- 
ing the tree is then equivalent to partitioning all the Monte Carlo samples 
into the components of various sublevel sets. Given ui < U2 < • • • < um, 
we define M level sets (energy rings), C^™) = {x | /i(x) € [um.^i,Um)} for 
m = 1, . . . , M, where uq = — oo. 

3.1. Connected components of level sets. Let us use the example in Fig- 
ure 1 to motivate our algorithm. In this example the energy values of 
mode 1 and the other six modes are and 2, respectively. Let Um = ni for 
m = 1, . . . , 70. If we have partitioned the sublevel set A{um-i) = Ur<m-i ^^^'^ 
into its connected components {Aj{um~i) \ j = 1, ■ ■ ■ ,K{um,-i)}, then there 
exist only three possibilities to induce the partition of A{um) from the com- 
ponents of C^"^\ denoted by {cl"^^}. First, C^"^^ is not connected to any 
components of A{um~i), which implies that it represents a terminal node 
(local minimum), such as the components containing minima 2 to 7 of the 

level set C^^^ = {x | /i(x) G [2, 3)}. Second, Cj-™^ is connected to a single com- 
ponent Aj{um-i), i € {1, . . . , K{um-i)}, such as any level set component on 
the branch between mode 1 and its parent (the barrier at energy = 20). 

Third, C^"^^ is connected to multiple components of A{um~i) and it corre- 
sponds to a barrier on the tree, for example, C^^^^ = {x | h(x) € [20,21)}. 
Clearly, the components of level sets serve as the building blocks for an 
inductive construction of the tree. 

3.2. The main algorithm. Define an empirical level set C^"^'^ and sublevel 
set by the collections of samples in C^™) and in A{um), respectively, 
for ui < U2 < • • • < ujv/. A (connected) cluster of a set of samples generated 
in C ^ is defined as the maximal subset of the samples in a connected 
component of D. Given a metric of the space, we employ single-linkage clus- 
tering (SLC) to partition an empirical level set into clusters. SLC recursively 
merges two closest subsets of samples according to the nearest neighbor dis- 
tance (NND) between them. Define the maximum NND of a set of samples 
by the NND between the two subsets that are merged at the last step in 
the SLC. Based on NNDs and subset sizes, we develop statistical meth- 
ods to identify clusters in MP and in a discrete space with details given in 
Sections 3.4 and 5.3, respectively. 

As illustrated in the previous subsection, we can construct the tree of sub- 
level sets by partitioning samples into clusters of empirical sublevel sets via 
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a bottom-up induction. Thus, we cah this method the bottom-up partition 
(BUP) algorithm, as outlined below. 

1. Initialization: Perform SLC on C^^^ to obtain clusters {C'^^^jf^ and the 
respective maximum NNDs of these clusters {4^^ if'- Let {i^^^lf^ = 

{C^^^lfi be the clusters of . 

2. Induction: For m = 2, . . . , M: 

(a) Perform SLC on C^"^^ to obtain clusters {Cj^^^jf™ and their maxi- 
mum NNDs {rj""^}; 

(b) Connect C^"^^ and A^"^ if the NND between them is < max(rj-™'\ 
^(m-i)^ for i = 1, . . . , i^;;, and j = 1, . . . , Km-i] 

(c) Merge the resulting connected clusters to obtain {A^f^^ }f *" , the clus- 
ters of A^'^'^ , and update their maximum NNDs d^^^ = maxjrl'"^ , 
^(m-i) ^(—1) c } for A: = 1, . . . , i^^. 

If the distribution /(x) is defined on a finite number of disconnected re- 
gions, this algorithm may build multiple trees, each for a connected compo- 
nent of the domain. Note that in step (2b) multiple C^"^^ 's may be connected 
to the same Aj^ ^\ This happens when different connected components of 

C^"") belong to the same component of A^"^\ 

If rtrn. is an estimated density of states for the mth level set, the algorithm 
also provides a simple way to approximate the local density of states: 

(4) Clm,k = V^^"^ -^m for A: = 1, ... , K^, 

where n^™) and n^™^ are the sample sizes of the level sets (7^™) and U{C'f™^ | 
(ji^) A^i^^}, respectively. The estimation (4) follows immediately from 
the definition of density of states and the fact that samples in a level 
set are approximately uniform. In this work density of states is estimated 
by the iterative approach implemented in the equi-energy (EE) sampler 
[Kou, Zhou and Wong (2006), Section 4], which is also applicable to sam- 
ples generated by parallel tempering. 

3.3. Theoretical considerations. Define a discretized version of the tree 
of sublevel sets at discrete energy levels {um}m=i by the tree that repre- 
sents the hierarchy among the collection {Ai^{um) \ k = 1, . . . ,K{um),Tn = 
1, . . . ,M}. Intuitively, one may imagine to use M horizontal lines at energy 
levels ui < U2 < • ■ • < um to intersect the original tree. Each intersection 
represents a component Ak{um) of the sublevel set A{um) {I < m < M). 
Then we use a line segment to link Aj{um.-i) to A}^[um) if and only if 
Aj{um-i) C Ak{um) for m = 2, . . . ,M. The resulting graph is the discrete 
tree which can be viewed as an approximation to the original tree. 
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Definition. Given ui < U2 < ■ ■ ■ < um, let 7^ = {A^ \k = I,. . . , Km, 
m = 1, . . . , M} represent a tree constructed from an empirical sublevel set 
i(A^) of size n. We say that 7^ is consistent if the following statements hold 
in probability for all m as n ^ oo: 

(i) K{um); 

(ii) supxg^^.(u^) d(x, y4^™^) — > 0, where d{'K,A) is the minimal distance 
from X to the set A; 

(iii) if "') C if ) if and only if A,{um-i) C Ak{um). 

The BUP algorithm will build a consistent tree if we have the following: 
(1) SLC on (7^"^) can provide a consistent estimate of the components of 
C^*") in the sense of (i) and (ii) in the above definition; (2) cf ^ and if 
can be connected consistently in step (2b). Some theoretical considerations 
for the verification of these two conditions are provided. 

Lemma 1. Let /(x) be a continuous density on X <zW . Suppose D d X 
is a compact subset with a connected interior and /(x) > Vx G D. If 
an f -irreducible Markov chain {X^} with invariant distribution f is Harris 
recurrent, then the maximum NND of Dn = {Xj | X^ G D, t = 1, . . . , n} and 
supxg£)0 (i(x, Z)„) converge to almost surely as n^oo. 

The proof of this lemma is given in the Appendix. As discussed in Tierney 
(1994), most MCMC algorithms, such as irreducible Gibbs samplers and 
Metropolis algorithms, are Harris recurrent under mild conditions, to which 
Lemma 1 applies. Note that any nonempty C^*"-* has compact closure {x | 
/i(x) G [um-i,Um]}, similarly for A{um)- If an empirical level set (7^™^ is 
generated by multiple Harris recurrent Markov chains with invariant dis- 
tributions strictly positive on C^™-*, such as tempered or truncated target 
distributions, then condition (1) will be satisfied if the distance between any 
two components of C^™) is positive. 

Lemma 2. Suppose /(x) and D satisfy the same conditions in Lemma 1. 
A random sample {X,} of size n is drawn from f and SLC is performed 
on Dn = {Xj € D} with a distance threshold p/n. For some p, there exists 
a big cluster that includes a positive fraction of Dn and passes within En 
of every element of Dn, and every other cluster has diameter (maximum 
within- cluster distance) < where Sn^O in probability as oo. 

Lemma 2 is a mild modification of Theorem 1 in Hartigan (1981). If cf ^ 
and are connected and each has a positive volume, then there is a 
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distance threshold p/n with which SLC produces a big cluster that contains 
fractions of samples in both sets, while every other cluster is arbitrarily 
small. This implies that p/n < max(r|'"\ dj"* and P{Rn < p/n) — 1 as 

n oo, where i?„ is the NND between C^"^^ and Aj^ ^\ Thus, condition 
(2) for the consistence of the algorithm will be satisfied with an i.i.d. sample 
as the input. If {X^} is a Markov chain as stated in Lemma 1, one can apply 
the BUP algorithm to a subsequence {Xj- \ i = 1, . . . ,n} with (ti+i — ti) 
sufficiently large such that this subsequence behaves like an i.i.d. sample 
from /. 

3.4. Clustering level sets in W^. Given that samples in a level set are 
approximately uniform if (um — ^m-i) is small, we consider the following 
results, which are proved in the Appendix. 

Lemma 3. Let h{x) be a continuous function inMP. Suppose a connected 
level set C = {x | /i(x) € [n — Au,u)} has a finite volume Vc > and a 
random sample of size n, {Xi, . . . ,X„}, is drawn uniformly on C. Let n 
oo. 

(1) Denote by ri the NND between Xj and the other sample points. Then 
nrf/Vc follows an identical exponential distribution (denote its mean by 6) 
and is independent of nrj/Vc (j i^i)- 

(2) Denote by d the NND between a finite subset X* = {Xj^ , ■ • ■ , Xj^ } 
and the other sample points. Then given X*, nd^ /Vc follows an exponential 
distribution with mean 6 / 13{(3 > 1). 

This lemma suggests that the NNDs in the SLC on an i.i.d. uniform 
sample decay exponentially fast in the order of {Vc/n)^/P for points within 
a connected component, which become significantly smaller than between- 
component distances (BCD) when the sample size is large. Thus, one may 
treat BCDs as outliers in an exponential sample and develop methods to 
detect them. Suppose that a random sample {Yi, . . . ,Yn} is drawn from 
Exp(0) with mean 9 and that the largest k observations are missing. Denote 
the observed order statistics by < • • • < y[n-k)- Then the observed data 
likelihood is 

n—k 

L{^\y{i),---,y{n-k)) = bw{-y{n-k)/0)t Yl ^~^exp(-y(j)/6'), 

i=l 

which leads to the MLE of 9: 

, _ E7=i Vji) + ky(n-k) 
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Let Ui = nrf with (i = 1, . . . ,n) the NNDs in the SLC of an i.i.d sample 
of (n + 1) points from a level set C. Suppose that C has K + 1 connected 
components and the NNDs among them are d(^x) > ■ ■ ■ ^ > 0. As n — > oo, 
we have approximately 

[ 9, '~ for k>K. 

This suggests that one may estimate K by the value of k from which 6k starts 
to stabilize. In order to choose a more interpretable threshold, we define 
Pk (xl/9k for k = 0,1, . . . , -fCmax — 1 and normalize Pk such that J2k = 1; 
where i^max ^ is a pre-determined maximal number of components of a 
level set. Consequently, P^ increases to a level slightly above 1/i^max with 
the increase of A;, as illustrated in Figure 3. For this particular level set with 
seven components, Pq, . . . ,P^ <^ l/-fCmaxi while the other P^s are all very 
close to and slightly above l/-fCniax- Such a pattern allows us to define two 
bounds for the number of components, Ki, = 1 + minjA; | P^. > (5^ • 1/i^max} 
for b = L,H, in which < 5l < Sh < 1- From the SLC of an empirical level 
set , we first obtain Kl clusters. If we gradually increase the number of 
clusters from Kl to Kh, {Kh — Kl) clusters will be split sequentially. We 
discard a resulting daughter cluster from a split if it contains < A^min points. 
The remaining ones form the clusters {C'|'™'^}f^'" for the level set. Such a 
cluster either contains more than Amin points or its NND is among the 
largest Kl — 1- This procedure rules out those small and often false clusters 
with moderate between-cluster distances. From our empirical studies, this 
approach seems to work very well even for samples generated from a Markov 
chain when the sample size is reasonably large. 

3.5. Practical issues. First, the complexity of the BUP algorithm is dom- 
inated by SLC of empirical level sets. If the samples are generated by an 
MCMC method, we typically resample without replacement about 20% of 
the samples as the input. This can reduce the computation greatly with- 
out degrading the performance. We divide the samples into enough level 
sets such that the size of each set is in the order of 5K to lOK. Second, 
the default values for the parameters in level set clustering are specified as 
Sl = 0.5, 6h = 0.95, ATjnax = 100 and Amin = 50. These values are used in 
all the examples presented in this article. We note that, for a level set of 
size 5K or more, the performance of the algorithm is not sensitive to the 
choice of these parameters. For some distributions, the algorithm tends to 
underestimate the number of components (K^) when a level set is close 
to an energy barrier since the between-cluster distance tends to be small. 
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Fig. 3. The log(Pfe) (solid dots, k = 0, ... ,99) of an empirical level set constructed from 
MCMC samples of a 5-D mixture normal distribution. This level set has seven components. 
The height of the horizontal solid line = log(l/i(rniax) = — log(lOO). 

Considering that another reasonable upper bound for the number of compo- 
nents of C^™-* is Km-i, the estimated number of components of A{um-i), if 
no new minima occur, we modify our upper bound to be m.ayi{K h , Km-i) ^ 
where Kh is the original bound defined by 5h . Finally, a more efficient way 
to determine whether C^^'^ and should be connected in step (2b) of 

the BUP algorithm is to sequentially compute the distances between c\^^'^ 
and different level sets in A^^^ ^'^ in the descending order of their energy. 
We stop the computation once we identify a level set whose NND to c\^'^ 

IS < max(rj' ,a- ). 

The BUP algorithm may take as input from a variety of Monte Carlo 
methods besides the EE sampler and parallel tempering. The multicanonical 
sampling [Berg and Neuhaus (1991)] and related methods 
[Hesselbo and Stinchcombe (1995); Wang and Landau (2001); Liang (2005, 
2007); Atchade and Liu (2006); Liang, Liu and Carroll (2007)] can generate 
samples at various energy levels and estimate density of states. The outputs 
from these algorithms should be suitable for our method to estimate the 
tree of sublevel sets as well. With slight modifications, the nested sampling 
proposed by Skilling (2006) may be another candidate sampler to produce 
inputs for the BUP algorithm. 

4. Examples in a continuous space. To demonstrate the use of the BUP 

algorithm in constructing the tree of sublevel sets, we test it on posterior 
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distributions given multivariate t data and a multimodal distribution with 
many local modes. 

4.1. Posterior inference from multivariate t data. Suppose we have ob- 
served a random sample Y = {yi, . . . , yjt} from a multivariate t distribution 
tiy{fx, S) with known degree of freedom and scale matrix, = 5 and S = Ip, 
where p is the dimensionality of the distribution. With a flat prior, our goal 
is to make inference on the location parameter fj, from its posterior distri- 
bution given the data Y. In this case, the energy function of the posterior 
distribution is 

. n 

(5) hi^l) = -log[P(A.|Y)] = '^Y.^og 

up to an additive constant. Since multimodality of a t-likelihood occurs with 
an appreciable chance for a small sample size, we design an observed data 
matrix (n = 6) as 

A A ax ax " 
A A ^ ^ ax ax 
a2 02 A A 
A A 02 02 ' 
as 03 A A 
03 03 A A. 

where A^ oj > (j = 1, 2, 3). Note the heterogeneity of the observed data 
in the sense that they form three sub-groups, each composed of two data 
points, in the 6-D space. 

We first set A = 40 and ai = 02 = as = 4 so that the data set is composed 
of three symmetric pairs of points. To set up the energy ladder for the EE 
sampler, we did the following pilot study on the posterior distribution. We 
randomly generated 100 points from the hyper-cube defined by the boundary 
values of the observed data, that is, [0,40]^, evaluated their energy (5), and 
chose the maximum as the upper bound for the energy ladder. We then 
performed a few runs of gradient-based minimization to obtain the energy 
values of some local minima, and set the lower bound for the energy ladder as 
the smallest local minimum minus 3. In this way, the energy ladder was set 
geometrically between [166, 220] and the temperature ladder between [0.2, 4] . 
Note that a higher temperature for this target distribution would cause an 
improper posterior. The combination of the energy and temperature ladders 
allowed the EE sampler to generate samples in a wide energy range from 
local minima to high energy barriers. We utilized 10 chains, each generating 
200K samples, and resampled 20% of the samples to estimate the tree with 
M = 50 level sets. This computation was repeated 10 times, each with an 
independent input of EE samples. The topology of the constructed tree. 
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Fig. 4. Estimated trees of sublevel sets for the posterior distributions given the t data: (A) 
symmetric data set, (B) asymmetric data set without local sampling, and (C) asymmetric 
data set with local sampling. Nodes are indexed in correspondence among the three trees. 



Table 1 

Critical energy estimation for the posterior distributions given the t data 



Symmetric Asymmetric 



Node 


Estimate (s.e.) 


Approx. 


Estimate (s.e.) 


Approx. 


Ml 


169.20 (0.015) 


169.18 


162.34 (0.003) 


162.33 


Ma 


169.20 (0.007) 


169.18 






Ms 


169.20 (0.006) 


169.18 


166.59 (0.008) 


166.57 


Mi 


169.20 (0.009) 


169.18 


166.58 (0.009) 


166.57 


Ms 


169.20 (0.006) 


169.18 


169.62 (0.007) 


169.60 


Me 


169.21 (0.012) 


169.18 


169.62 (0.008) 


169.60 


Bi 


171.0 (0.075) 


170.9 






B2 


171.0 (0.084) 


170.9 


167.0 (0.018) 


167.0 




171.0 (0.074) 


170.9 


171.4 (0.040) 


171.3 




197.5 (0.273) 


198.5 


198.5 (0.893) 


200.5 



exactly identical among different EE samples, is shown in Figure 4(A) with 
critical energy values reported in the left column of Table 1. The estimated 
tree of sublevel sets is composed of three long branches for a wide range of 
energy levels from B4 = 197.5 to Bi = 171.0 {i = 1,2,3). Further down the 
energy level, each of these branches splits into two symmetric local minima. 
Each local minimum Mj is located near a data point for i = 1, . . . , 6. To 
verify this constructed tree, we utilized a gradient-based local search from 
many random initial values, including the six data points to identify local 
minima, as reported in the column "Approx." in Table 1, which gave us 
exactly the same six local modes as identified on the tree. Furthermore, 
we approximated the energy barriers by finding the maximal energy along 
the line segment between every pair of local minima. These approximated 
barriers are all very close to what we have obtained on the tree (Table 1). 
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From the estimated local density of states, we computed that the proba- 
bility mass of each local mode (3) is only 0.015, while the total probability 
of the three long branches between B4 and Bi [i = 1,2,3) is around 0.91. 
This suggests that the three within-component averages of the empirical 
sublevel set ^(.64) form a good representation of the posterior distribution 
[Figure 4(A)]. It is definitely better than the overall posterior mean which 
is located outside of any high probability region. It is also more appropriate 
than the six local modes which seem to "overfit" the small highest proba- 
bility regions with posterior probability < 0.1. Clearly, the tree of sublevel 
sets provides much more information to interpret the posterior distribution 
than marginal averages and posterior modes. 

Next we reset ai = 2 and 02 = 3 to obtain an asymmetric data matrix 
in (6). After a similar pilot study on the energy function (5) to set the 
energy ladder in [160, 220], we applied the EE sampler followed by the BUP 
algorithm with exactly the same parameter settings for 10 independent runs. 
There are only three branches on each of the constructed trees [Figure 4(B)], 
with a global minimum (Mi^2) near the center of yi and y2. The other two 
local minima Mjj [(i,j) = (3,4), (5,6)] show larger variability: they are close 
to either or . This implies that we may need to generate more samples on 
these two branches to refine our estimates. If we naively increased the sample 
size in the EE sampler, it would cause a much heavier computational burden 
on tree construction without any obvious improvement for the problem, since 
the local density of states on the branch [Mi. 2,-64] is exponentially larger 
than those on the other two branches. However, with a coarsely estimated 
tree, one can design a more efficient way to refine the local sampling of a 
branch. Given an energy level H* between B^^ and M^. for k = (3, 4) or (5, 6), 
we wish to restrict the EE sampler to the connected component Ak{H*) that 
contains the minimum Mj.. This can be achieved by defining a modified 
energy function, 

^ loo, otherwise, 

where B{Mk,dk) is the ball centered at with radius dk- Based on the 
coarse tree, we chose and H* such that A{H*) n B{Mk,dk) = Ak{H*). 
Then we performed EE sampling from the two modified local energy func- 
tions for M3^4 and Ms^g, respectively, with 5 chains of 200K samples. The 
energy ladders were set between (M^ — 2) and H* = (0.3Mfc + 0.7i?4). Here 
we use Mk to denote as well the energy of the local minimum. Other pa- 
rameters for sampling and tree construction were identical to the previous 
calculations. This refined tree estimation with local sampling was performed 
for 10 independent runs based on the estimated coarse trees. The refined tree 
is shown in Figure 4(C) with critical energy values reported in the right col- 
umn of Table 1, which are consistent with gradient-based approximations. 
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Now one sees that two small minima have been recovered on each of the 
refined branches. Note that by the volume of Ai^{H*), one can renormal- 
ize the local density of states estimated from local sampling to obtain its 
corresponding value on the coarse tree, and thus perform related probabilis- 
tic calculation. For instance, the probability masses are estimated to around 
1.2 X 10"'* for M3 and M4, and around 1.7 x 10"'* for M5 and Mq. This exam- 
ple demonstrated the flexibility and power to manipulate posterior sampling 
with the aid of the tree of sublevel sets. 



4.2. The Rastrigin function. We further test our method on a distribu- 



tion with a large number of local modes. Let x = [xi, , 
function [Gordon and Whitley (1993)] is defined as 



, Xp] . The Rastrigin 



(7) 



i=l 



P 

p - ^ cos(7rxj) 
1=1 



where A is a positive constant and p is the dimensionality of the variable x. 
This is one of the benchmark functions used to test a global optimization 
algorithm such as the genetic algorithm [Holland (1975)]. Although closely 
related, our purpose of constructing the tree of sublevel sets is more chal- 
lenging than global optimization. We take A = 2 and p = 4 in (7) to obtain 
an energy function with 3^ = 81 local minima formed by all the elements of 
the product set {-1.805, 0, +1.805}^. These minima have five distinct energy 
values shown in the theoretical tree [Figure 5(A)], dependent on the combi- 
nations of their coordinates. Correspondingly, we group them into five layers 
so that the jth layer contains local minima whose coordinates are composed 




14.49(16) 

..>^ 10.87 (32) 

8.74 



..>»7.24 (24) 
5.11 

3.62 (8) 



3.76 

^^7.39 (23,9) 
5,13 



3.67 (8) 



(B) 



Fig. 5. The trees of sublevel sets of the Rastrigin function. (A) Theoretical tree; (B) 
Estimated tree. The critical energy values are labeled on the trees and the number of minima 
in each layer is indicated in the parentheses. 
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Table 2 

Local minima and barriers of the 4-D Rastrigin function 





Layer 


1 


2 


3 


4 


5 




Count 


1 


8 


24 


32 


16 


A 


Energy 





3.62 


7.24 


10.87 


14.49 




Barrier 




5.11 


8.74 


12.36 


15.98 




Count 


1 (0) 


8(0) 


12.1 (2.96) 


0(0) 


0(0) 


B 


Energy 


0.005 (0.003) 


3.70 (0.070) 


7.62 (0.107) 








Barrier 




5.13 (0.085) 


8.54 (0.318) 








Count 


1 (0) 


8(0) 


23.9 (0.32) 


0(0) 


0(0) 


C 


Energy 


0.005 (0.003) 


3.67 (0.023) 


7.39 (0.074) 








Barrier 




5.13 (0.085) 


8.76 (0.151) 







Note: "Count" and "Energy" refer to the number and the energy of the local minima in 
a layer, respectively. "Barrier" is the energy barrier between the current and the previous 
layers [see Figure 5(A)]. Theoretical values and estimated values without/with linear in- 
terpolation are given in panels A, B and C, respectively. The standard errors of estimates 
in the same layer from 10 independent EE samples are given in the parentheses. 



of (5 — j) zeros and (j — 1) ±1.805, for j = 1, . . . , 5. The theoretical values 
of the local minima and energy barriers are given in Table 2 panel A. 

We applied the EE sampler to this energy function with 20 chains, in 
which the energy was truncated evenly between [0, 19] and the tempera- 
ture was fixed at T = 0.5. Thus, the target distribution of the A;th chain 
was /fc(x) (X exp[— (/i(x) V k)/T] for A; = 0, ... , 19 and T = 0.5. We generated 
lOOK samples from each chain and resampled 20% of them to construct the 
tree with M = 50 level sets. This whole process was repeated 10 times in- 
dependently. As reported in panel B of Table 2, for all the 10 independent 
inputs of EE samples, the BUP algorithm identified all the 9 minima in 
the first two layers and about half of those in the third layer. It also de- 
tected unambiguously the energy barriers associated with these three layers 
of minima. 

There might be two reasons why our method failed to identify some high- 
energy local modes in layer three and beyond. First, the EE sampler did 
not visit them because of the tiny probability associated with these modes. 
Even for a truncated energy function, the local density of states of such a 
mode may be much smaller than that of the connected component which 
contains low-energy modes, and thus, the EE sampler has almost no chance 
to explore them. For example, the ratio of the local density of states of a 
mode in the fourth layer at energy u = 10.9 over that of the middle main 
branch which connects to lower energy nodes on the tree [Figure 5(A)] is 
approximately 3 x 10~^. Intuitively, one may think of the main branch as 
much "thicker" than the leaves of the same height. Second, it is possible that 



RECONSTRUCTING ENERGY LANDSCAPES 



17 



the EE sampler visited some of these modes, but with insufficient samples 
they were not identified by the BUP algorithm as clusters of a level set. Re- 
call that a detected cluster with a moderate NND (ranked between K]^ and 
Kh) must have at least A'^min = 50 samples, which might be too stringent 
for a small high-energy mode. This motivated us to use linear interpolation 
of the energy function to enhance the sensitivity in identifying clusters of 
a level set. Given Kl and Kh, consider splitting a cluster in-between as 
described in Section 3.4. Suppose a resulting daughter cluster D has less 
than Nmir^ points. We define for D a pair of points in the current empirical 
level set by (x*,y*) = argminxg/j^y^D (i(x,y), which determine the NND for 
agglomerating D with its closest cluster (sister cluster). Note that x* and 
y* can be obtained along with single-linkage clustering and no additional 
computation is needed. Then we evaluate the energy values of 100 points 
evenly distributed along the line segment between x* and y*. The maxi- 
mal energy of these interpolated points serves as an approximation to the 
barrier between x* and y* or between D and its nearest neighbor cluster. 
Consequently, D will not be discarded if the maximal interpolated energy is 
higher than the upper energy bound of the current level set. 

When we apphed the BUP algorithm with this linear interpolation to 
the same ten sets of EE samples from the Rastrigin function, it identified 
almost all the 33 local minima in the first three layers with reduced biases 
and standard errors (Table 2 panel C). For nine sets of samples, the method 
identified all the 33 local modes exactly. For the other one set, it missed one 
local mode in the third layer. All the constructed trees have a similar topol- 
ogy as shown in Figure 5(B). Although it failed to recover the tree topology 
for layers four and five, our approach still provided a global understanding 
of the energy landscape of this distribution since any statistical property 
of the distribution is unlikely to be affected, in practice, by ignoring those 
high-energy modes of tiny probabilities. 

5. Bayesian segmentation of DNA sequences. To account for the het- 
erogeneity of DNA sequences, statistical models have been proposed to 
segment a DNA sequence into pieces with more homogeneous nucleotide 
compositions [e.g., Liu and Lawrence (1999); Boys and Henderson (2004); 
Keith (2006)]. Bayesian inference on these models is performed via poste- 
rior sampling of segmentations (or change points). In this study we adopt 
the model of Liu and Lawrence (1999) to sample from a posterior distribu- 
tion of change points. Then we apply the BUP algorithm to reconstruct the 
energy landscape of the posterior distribution which is expected to be quite 
complicated. 

5.1. A Bayesian model. Denote a DNA sequence of length L by Y = 
. . . , yi] = yi-L-, where yi G {a, c, g^t} for / = 1, . . . , L. Assume that there 
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exist at most N change points. Let Z = [zi, . . . , Zp] {0 <p < N) denote the 
locations of change points, which segment the sequence into p + 1 pieces. 
Note that p = imphes that the whole sequence is in one segment. We 
assume that the prior distribution for the number of change points p is 
uniform on {0,1,..., N}. Given p, we assume that every possible placement 
of the p change points is equally likely in the prior. Within a segment defined 
by the change points, the nucleotides follow a multinomial distribution. We 
are interested in the posterior distribution of the locations of the change 
points Z given the sequence data Y, 

(8) P(Z|Y)oc7r(p)7r(Zb)P(Y|Z), 

where 7r(-) is used as a generic notation for prior distributions. Note that 
under a conjugate Dirichlet prior for the multinomial distribution in a seg- 
ment, -P(Y|Z) in (8) can be computed exactly by ratios and products of 
gamma functions. 

5.2. Posterior sampling. Exact sampling from the posterior distribution 

(8) can be achieved via dynamic programming as used in Liu and Lawrence 
(1999). This approach first samples from the marginal posterior distribution 
of p (the number of change points) and then samples sequentially all the p 
change points from Zp to zi. The key to this exact sampling is a recursion 
on the conditional probability of observing a partial sequence yi-j given that 
it has k change points pij = k, 

I 

P{yi:l \Pl:l = k)= P{yi:zk-l W-Z^-l = k - I) 

(9) 

X P{yz^d\Pzk:l = 0)vr(zfcbi:« = k), 

where T^{zk\pi:i = k) is the conditional prior probability to place the kth 
change point at z^, given that there are k change points between yi and 
yi. We pre-compute the probability of every subsequence yi-i (1 < i < / < 
L), given that it is generated from a multinomial distribution, that is, 
P{yid\Pi:i = 0), which can be calculated in closed-form based on gamma 
functions as we mentioned before. Then recursive forward summation (9) is 
applied to compute P{yi:i \pi:i = k) for /c = 0, . . . , and Z = 1, . . . , L. After 
all the summations, we sample the number of change points p = pi-L from 
P{pi:L = ^|Y) oc 'K{k)P{yi-L\pi:L = k). Given p, one can sequentially impute 
the change points Zp, . . . , zi based on the additive terms in the summation 

(9) . Please refer to Liu and Lawrence (1999) for more details. 

We define the energy function of Z (a set of change points or a segmen- 
tation) by 



(10) 



/i(Z) = -logP(Z|Y) 
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and a tempered distribution at temperature T by 

P(Z|Y;T) cxexp(-/i(Z)/r) = [P{Z\Y)f'^ . 

It is easy to see that one can use a similar dynamic programming method 
to sample from P(Z|Y;T) for any T > 0. Our strategy for this problem is 
to generate (independent) samples at various temperatures and use them to 
reconstruct the energy landscape of the posterior distribution. 

5.3. Metric and clustering. The space of change points is discrete in na- 
ture. We need to define connectedness in this space based on some metric. 
A natural choice of a metric between two sets of change points (segmen- 
tations) is the number of sequence positions that are partitioned into dis- 
tinct segments. Write the segments defined by Z = [zi, . . . , Zp] as {[zk-i,ziS) \ 
A: = 1, . . . ,p + 1}, where zq = 1 and Zp+i = L + 1, and the segments by 
X. = [xi, . . . ,Xg] as {[xj-i,Xj) \ j = 1, . . . ,q + 1} similarly. For a one-to-one 
map g from a subset of {1, . . . ,p + 1} into {I, . . . ,q + I}, the total number 
of common sequence positions between all pairs of mapped segments is 

5g(Z,X)= \[Zk-l,Zk)n[Xy^k)-l,Xg(k))\, 

k:g{k)^4> 

where | • | returns the number of integers in a set. We find the map that 
maximizes ^^(ZjX) and then define the metric (distance) between Z and 
X as (i(Z,X) = L — maxg ^^(Z, X). For example, if L = 10, Z = [3,9] and 
X = 8, then the desired map is g*{2) = 1 and 5* (3) = 2, which maps seg- 
ments 2 ([3,8]) and 3 ([9,10]) defined by Z to segments 1 ([1,7]) and 2 
([8, 10]) defined by X respectively, and the resulting distance is 3. Note that 
the minimal distance between two distinct segmentations is 1, which implies 
a natural way to define connectedness. We say that a set D in this space is 
connected if for any two segmentations Z^, Z;, G D, there exist m segmenta- 
tions Zi, . . . , Zm G D with Zi = Za and Z^ = Z;, such that d(Zj, Zj-|_i) = 1 
for i = 1, . . . , m — 1. 

Suppose n + 1 segmentations have been sampled in a level set C and the 
resulting NNDs in SLC are di,d2, ■ ■ ■ ,dn. Motivated by the observation that 
the histogram of dj (i = 1, . . . , n) decays exponentially if C is connected, we 
model them by a geometric distribution, P{di\P) = /3'^'(1 — /3) (dj = 0, 1, . . .), 
where f3 S (0,1) is an unknown parameter. We rank di to obtain the order 
statistics < (i(2) < • ■ ■ < d{n) • If C consists of i^T + 1 connected compo- 
nents, one expects the largest K NNDs to be significantly greater than the 
rest when n is large. By the similar missing data formulation as in Sec- 
tion 3.4, the MLE of /3 if the largest k distances are not observed is 

j:7=id^i) + f'd^n-k) + in-k)' 
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Then the mean of the geometric distribution can be estimated by 

"k — ~ — ; • 

l-f3k n-k 

From the memoryless property of the geometric distribution, the expected 
value of di given di > d^^n-k) and 6k- Denote by A'niax(= 100) the pre- 
determined maximal number of components. For k = 1, . . . , i^'max — 1 we com- 
pute 9k and define 7^ = {d(^n~k+i) ~ '^{n-k))/^k as a statistic to test whether 
the observed largest k NNDs are significantly greater than that expected 
from the other n — k distances. Let = 1 and find Kh = 1 + maxj/c | 7^ > 
a, d[n-k+i) ^ 2}, where a = 10, and we define the maximum of an empty set 
as 0. Then the same pruning procedure as described for a continuous space 
by the size of a potential cluster is applied to obtain the final clusters of an 
empirical level set. 

5.4. A simulation study. We simulated 10 DNA sequences each of length 
L = 1000 from the above segmentation model with four change points [201, 
401,601,801]. Denote the nucleotide composition of the ith segment (i = 
1, . . . , 5) by 0j = [6ii, . . . , 9i4] for a, c, g, t, respectively. For the first four seg- 
ments, 9ii = 0.4 and 6ij = 0.2 (j ^ i). For the last segment, O^j = 0.25 for all 
j- 

Set the maximal number of segments + 1 = L/lOO = 10. We found from 
a pilot study that, at r = 0.5, samples were concentrated around the global 
mode, while at T = 2 most samples were composed of 8 or 9 random change 
points. Thus, we applied the exact posterior sampling at 10 temperatures 
between 0.5 and 2 with geometric progression. For each temperature, 50K 
samples were generated. All the samples from different temperatures were 
partitioned into M = 100 level sets for estimating the tree of sublevel sets. 
The estimated trees have quite complicated structures, with an average of 
45.3 local minima and 24.7 energy barriers. The number of local minima 
ranges from 21 to 74 and the number of energy barriers ranges from 6 to 
49 for the 10 sequences. A local minimum in this discrete space is defined 
as a segmentation Z whose energy (10) is lower than the energy of all the 
neighbors (whose distance to Z equals one). For such a discrete space, it is 
feasible to verify an identified local minimum by this definition. Among the 
total of 453 detected minima on the 10 trees, 436 of them are true ones. We 
further applied steepest-descent optimization to the remaining 17 detected 
minima. It turned out that 16 of them led to distinct local minima on their 
respective trees and only one out of the 453 detected minima was produced 
by a false split of a terminal branch. These results demonstrated the high 
specificity of our tree construction algorithm. 
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We illustrate the results by one of the constructed trees. The posterior 
probabilities of the locations of change points (T = 1) are shown in Fig- 
ure 6(A). As one can see, there are many peaks along the sequence and 
it is hard to find a reasonable principle to predict change points based on 
these marginal posterior probabilities, since the information on the combi- 
natorial pattern among these potential change points is not revealed by such 
marginal statistics. However, the estimated tree of sublevel sets [Figure 6(B)] 
provides an informative way to understand this complicated posterior dis- 
tribution. At the energy level of 1388, the sublevel set of the energy function 
of this posterior distribution is composed of three disconnected valleys. The 
first valley contains two local minima (indexed as 29 and 26 in the figure), 
the second valley contains a single local minimum 27, and the third valley 
contains many local minima. We report the respective lowest minima of the 
three valleys (minima 26, 27 and 6) in Table 3, from which we see that they 
are composed of 3, 5 and 4 change points, respectively. The change points 
of minimum 6, which is the global mode, are close to the four simulated 
ones. Minimum 26 does not contain the last change point, while minimum 
27 splits the last change point into two. In addition, there exists another 
branch with a high-energy minimum 31, which has 3 change points around 
positions 400, 600 and 800. We further focus on the third valley which con- 
tains many local minima and report a few representative ones from different 
branches (minima 8, 12, 2 and 25) in the table. They all have four change 
points around the true locations, but with different combinations of local 
shift compared to minimum 6. The tree of sublevel sets definitely provided 
much more insights on the Bayesian inference of this problem: Not only did 
it detect multiple modes, but also recovered the hierarchy among them and 
revealed the combinatorial pattern among change points. 

5.5. Mouse upstream sequences. A recent study identified eight genes 
that are up-regulated and function as activators in mouse embryonic stem 
cells [Zhouet al. (2007)], namely, Oct4, Sox2, Nanog, Esrrb, Tcf7, Nr5a2, 
Otx2 and Etv5. We extracted upstream 1,500 bases to downstream 500 bases 
around the transcription start sites and call them the upstream sequences of 



Table 3 

Selected local minima of change points on a simulated sequence 



Index 


Change points 


Index 


Change points 


26 


[205, 393, 608] 


8 


[205,393,609,787] 


27 


[205,393,608,756,816] 


12 


[205,393,609,806] 


6 


[205,393,609,813] 


2 


[199,396,609,813] 


31 


[393,609,818] 


25 


[216,394,609,813] 
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Fig. 6. Results for a simulated sequence. (A) Marginal posterior probabilities of change 
point locations along the sequence. (B) The estimated tree for the posterior distribution. 
The numbers in the figure are the indices of local minima. 
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Fig. 7. Results for the upstream sequence of Nanog. (A) Marginal posterior probabili- 
ties of change point locations along the sequence. (B) The estimated tree of the posterior 
distribution with labeled valleys and minima. 



the genes. We set the maximal number of segments + 1 = L/lOO = 20 and 
apphed posterior samphng followed by the BUP algorithm with exactly the 
same parameters as we used in the simulation study. The energy landscapes 
of the posterior distributions as revealed by estimated trees exhibit very 
different characteristics for different genes. The tree of the Oct4 upstream 
sequence contains only one minimum with no change points (i.e., the whole 
sequence is in one segment), while the tree of the Esrrb upstream contains 



RECONSTRUCTING ENERGY LANDSCAPES 



23 



Table 4 

Selected local minima of change points on the Nanog sequence 



Valley 


Minimum 


Change points 


A 


6 


[225,361,1376,1460] 


A 


12 


[254,361,1376,1460] 


B 


47 


[225,361,1376,1410, 1461] 


C 


23 


[225, 307, 333, 355, 1376, 1410, 1461] 


D 


3 


[254, 307, 333, 355, 1376, 1460] 


D 


5 


[225, 307, 333, 355, 1376, 1460] 


D 


8 


[275, 307, 333, 355, 1376, 1460] 


E 


53 


[225, 307, 333, 355, 1240, 1251, 1376, 1413, 1460] 



55 local minima. On average there are 28.9 local minima and 17.6 barriers 
on the estimated trees. Among all the 231 detected minima on the eight 
trees, 227 of them are verified to be true ones, three of them lead to distinct 
minima via steepest-descent optimization, and only one of them corresponds 
to a false prediction. 

We choose the results of the Nanog upstream sequence as an illustra- 
tion. The posterior probabilities of change point locations are shown in Fig- 
ure 7(A), where likely locations of change points indicated by peaks in the 
plot are mostly distributed in the intervals [100,400] and [1100, 1500]. (Note 
that the transcription start site is at position 1500.) The estimated tree of 
sublevel sets contains 53 local minima [Figure 7(B)] and a few deep energy 
valleys (big branches on the tree) labeled as A, B,...,E in the figure. We 
select eight representative local minima labeled in the figure and reported 
in Table 4. The overall picture of the energy landscape is very clear. The 
deep valleys correspond to segmentations with different number of change 
points, such as valleys A and D which contain minima with 4 and 6 change 
points respectively. Within a valley defined in Figure 7(B), the local minima 
generally have similar combinations of change points. For example, the three 
labeled minima in valley D, each in a sub-branch of D, share five common 
change points, but differ in the location of the first one between 220 and 280 
(Table 4). 

Compared with the marginal posterior probabilities in Figure 7(A), the 
estimated tree clearly revealed much more information on the posterior dis- 
tribution. From the marginal probabilities one can only identify change 
points from the local peaks with no information to determine the combi- 
nation among them. However, the tree of sublevel sets not only identified 
different possible combinations of change points, but also organized them 
into a hierarchical structure that brings connectivity to differentiate and 
group these local minima. As illustrated, such information is very helpful 
for understanding the posterior distribution. One can view the posterior 
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distribution as a multilevel mixture. At the first level, its energy landscape 
can be roughly represented by a mixture of a few large valleys [such as the 
ones labeled in Figure 7(B)]. Each valley may be further decomposed into 
smaller sub-valleys represented by various local minima. 

6. Discussion. We have formulated the tree of sublevel sets to character- 
ize the energy landscape of a distribution, and developed the BUP algorithm 
to estimate the tree from Monte Carlo samples. The use of level sets as the 
building blocks in our method for tree construction has two advantages: (1) 
The samples in a level set are roughly uniform, which helps the development 
of algorithms to identify clusters; (2) Level set clustering requires much less 
computation as compared to clustering sublevel sets. The design of the BUP 
algorithm fits very well the EE sampler, which constructs empirical energy 
rings (level sets) for a wide spectrum. As we have mentioned in Section 3.5, 
a few other Monte Carlo methods are also good candidates for generating 
input samples with estimated density of states for the BUP algorithm. 

Similar concepts of the tree have been used independently to describe the 
energy landscape of a physical system under the name of a disconnectiv- 
ity graph [Becker and Karplus (1997)], with applications to peptide models 
[e.g., Krivov and Karplus (2002), Evans and Wales (2003)], lattice spin sys- 
tems [e.g., Garstecki, Hoang and Cieplak (1999)] and protein folding path- 
ways [Evans and Wales (2004), Carr and Wales (2005)] among others. A dis- 
connectivity graph is constructed given a database of critical states of an 
energy surface, such as local minima, transition states (energy barriers) and 
pathways from a local minimum to a transition state. Optimization methods 
are often employed to search an energy surface for its critical states, mostly 
based on gradient approaches that utilize the first and second derivative 
matrices or ad hoc approximations for specific models [see Wales (2005) for 
a review]. The BUP algorithm in this article can also be used to construct 
the disconnectivity graph of a given potential energy surface. A unique fea- 
ture of our method is that the construction of the tree is based on level 
set clustering of Monte Carlo samples and can be applied to any configu- 
ration space on which connectedness is defined through the use of a metric 
(or even a pseudo-metric). This is very important for statistical applications 
since a derivative-based search may be very difficult (e.g., for missing data 
problems) and even impossible (e.g., for discrete spaces). A firm compar- 
ison between the BUP algorithm and other chemical physics methods in 
constructing potential/free energy landscapes will be an interesting future 
direction of this work. 

We believe that, with the fast development of powerful sampling methods 
and the exponential increase in computing capacity, computational statisti- 
cal methods to extract useful information from large-size simulated samples, 
such as the BUP algorithm in this paper, are expected to play critical roles 
in many modern scientific fields. 
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APPENDIX 



Proof of Lemma 1. Since D is compact, for any e > its interior 
can be covered by a finite number of e-balls, -B(xj,e), where n-i ^ D and 
Bi = n B{^i,e) is nonempty for i = 1, . . . . Because /(x) > Vx G D 
and /(x) is continuous, every Bi has positive probabihty measure induced by 
/. The Harris recurrence of {Xt} imphes that the chain wih visit Bi infinitely 
often (i.o.) with probabihty 1, that is, P(X.t € Bi, i.o.|Xo = x) = 1, for ah 
X € A" and all i = 1, . . . , N . Since A'" is finite, 




G-B,, i.o.} 



Xn = x =1. 



Thus, there exists at least one X^ in each Bi with probability 1 as n ^ oo, 
which implies that supx^/jo d(x, < 2e. Because is connected, any 
two points in Dn can be linked by a continuous path covered by a subset 
of {Bi}. Thus, the two points will be joined in SLC with maximal NND 
distance < 4e, which completes the proof. □ 

Proof of Lemma 3. (1) Let Vi = aprf be the maximal volume of a ball 
centered at Xj that does not contain any other points, where ap is a posi- 
tive constant. Let Bi be the open ball centered at X^ with volume v/n. The 
probability P{Vi > v/n) = (1 — Ve^nc/^c)"""'^! where VBifiC is the volume 
of Bi n C. Since /i(x) is continuous, P{Bi C C) — > 1 as n — > oo and, thus, 
P{Vi > v/n) ^ (1 - v/{nVc))'^-^ e'"/^'^ . This shows that the asymp- 
totic distribution of rarf/Vc is identically exponential with mean = 1/ap 
for any i. Because P{Bi Bj ^ (j)) ^ as n — > oo, the joint probability 
P{Vi > v/n,Vj > w/n) — > e~'^'"^'^^/'^^ and, thus, nr^ /Vc and nr^/Vc are in- 
dependent asymptotically. 

(2) We put k balls of identical radius centered at Xj^ , . . . , Xj^. . Then UpdP 
is the maximal ball volume such that none of the k balls contains any points 
other than X*. Similarly, one can show that P{apdP > f/nlX*) q-P^I'^c 
as n — >■ oo, where 1 < f5 < k. Obviously, the conditional mean of ndP/Vc is 
9/(3. □ 
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